Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe#1521
Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe#1521jenchen13 wants to merge 8 commits into
Conversation
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
### What does this PR do?
Type of change: Bug fix
This PR enables `auto_quantize` for Megatron expert parallel MoE flows
by including the expert model parallel group when aggregating scores and
costs and when synchronizing selected recipes. It also derives the
search budget from the no-quant candidate costs in `candidate_stats`, so
sharded expert layers use global candidate costs instead of local module
weights.
### Usage
```python
model, search_state = mtq.auto_quantize(
model,
constraints={"effective_bits": 8.0},
quantization_formats=[mtq.NVFP4_DEFAULT_CFG, mtq.FP8_DEFAULT_CFG],
data_loader=data_loader,
forward_step=forward_step,
)
```
### Testing
- Focused Megatron EP test from local log: `python -m pytest
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py::test_auto_quantize_moe_ep
-xvs` in NGC PyTorch 26.01 (`1 passed` in 134.37s).
- Added unit coverage for deriving the auto_quantize budget from
no-quant candidate costs.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: N/A
### Additional Information
Base branch: `jennifchen/super_nvfp4_recipe`.
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: Jenny Chen <jennifchen@nvidia.com>
Co-authored-by: Jenny Chen <jennifchen@nvidia.com>
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThis PR extends NVFP4 static-block quantization with calibration validation and state restoration; adds distributed expert-parallelism support to auto-quantize including format consensus across EP ranks; implements per-layer mixed-precision quantization metadata recording and export for Megatron-Core models targeting Hugging Face format; and introduces HF Hub offline mode support with two new Nemotron-3-Super-120B PTQ recipe configurations. ChangesNVFP4 Static Quantizer and Expert Parallelism Integration
Comprehensive Test Coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
A continuation of #1363 |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/export/unified_export_megatron.py (1)
818-828:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTreat
QUANTIZATION_NONEas unquantized when buildingexclude_modules.This branch only records excludes for
qformat is None, but the same method immediately returns early onqformat == QUANTIZATION_NONE, and_qkv_slicing()already treats both values the same. As written, any normal module reported asQUANTIZATION_NONEwill skip the HF ignore list even though it is still unquantized.Suggested fix
- if qformat is None and "norm" not in prefix: + if qformat in (None, QUANTIZATION_NONE) and "norm" not in prefix: self._record_excluded_module(prefix)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_megatron.py` around lines 818 - 828, The code currently only calls _record_excluded_module(prefix) when qformat is None, but QUANTIZATION_NONE should be treated the same; update the branch in unified_export_megatron.py (the block around qformat, QUANTIZATION_NONE, _get_weight_bias, and _record_excluded_module) so that if qformat is None or qformat == QUANTIZATION_NONE (and "norm" not in prefix) you record the module as excluded before the early return; keep the existing early return for QUANTIZATION_NONE but ensure the exclude is recorded first and keep compatibility with _qkv_slicing behavior.modelopt/torch/quantization/algorithms.py (1)
765-782:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRecompute the persisted score/cost after recipe synchronization.
After
best_formatis replaced by the DP/TP/EP-synchronized value,best_constraintsandbest_scoresare still accumulated from the local solver choice. On ranks that did not originate the synchronized format,self.best["constraints"]/self.best["score"]can end up describing a different recipe than the one actually activated and checkpointed.Suggested fix
for name, best_hparam_recipe_info in best_recipe_info.items(): # Solvers could give different solutions for the same layer across DP/TP/EP groups even though # the scores and costs are the same. Lets make sure the same recipe is selected across DP/TP/EP _ps = self.model.get_submodule(name.split(".quant_recipe")[0]).parallel_state best_format = DistributedProcessGroup.get_dist_syncd_obj( best_hparam_recipe_info["format"], [ _ps.data_parallel_group, _ps.tensor_parallel_group, _ps.expert_model_parallel_group, ], lambda a: a[0], ) best_recipe[name] = best_format - get_hparam(self.model, name).active = best_format - best_constraints += best_hparam_recipe_info["costs"] - best_scores += best_hparam_recipe_info["scores"] + hparam = get_hparam(self.model, name) + hparam.active = best_format + best_constraints += hparam.get_cost(best_format) + best_scores += hparam.get_score(best_format)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/algorithms.py` around lines 765 - 782, The loop currently accumulates best_constraints and best_scores from best_hparam_recipe_info before replacing the local solver's format with the DP/TP/EP-synchronized best_format; update the code so that after you set best_recipe[name] = best_format and get_hparam(self.model, name).active = best_format you recompute and add the costs and scores that correspond to the actually activated best_format (not the original best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores that the solver produced for the layer (referencing best_recipe_info, best_hparam_recipe_info and get_hparam) and use that entry to increment best_constraints and best_scores (and keep self.best["constraints"]/self.best["score"] consistent with the activated recipe).
🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py (1)
32-42: ⚡ Quick winAdd one regression that uses only the restored
_global_amaxpath.The implementation change specifically supports static quantizers restored with
_global_amax, but this helper only seedsglobal_amax, so the new restore path is still untested. A single round-trip case that sets_global_amaxdirectly would keep the actual bugfix from regressing.As per coding guidelines,
tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.Also applies to: 45-70
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py` around lines 32 - 42, Add a focused unit test that exercises the restored _global_amax code path: create an NVFP4StaticQuantizer via the existing helper _make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the private attribute _global_amax (not global_amax) to a tensor value, perform the export/import (or the same round‑trip flow used elsewhere in this test file) and assert the quantizer restores using the _global_amax path (e.g., resulting amax/global_amax behavior matches expected values). Ensure the test is small, documents the expected behavior, and only validates the single round‑trip regression scenario so the `_global_amax` restore remains covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 47-79: The routed-expert weight quantizers in this max-calib
recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and
'*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for
a fair max-vs-MSE comparison; update those two quantizer blocks to use type:
static (leave the corresponding input_quantizer blocks as-is) so only the weight
quantizers for routed experts switch from dynamic to static.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 30-32: The calibration comment is misleading about FP8 scale
selection: update the comment near the calibration block that mentions "FP8
per-tensor scales" and "NVFP4 weights" (the lines describing MSE searches) to
explicitly state that only NVFP4 weight block scales are selected via MSE while
non-NVFP4 FP8 formats skip MSE and use the stack's default scaling method; edit
the text to clarify that FP8 per-tensor scales for non-NVFP4 are not
MSE-searched to avoid confusion for recipe users.
In `@modelopt/torch/quantization/plugins/custom.py`:
- Around line 148-153: The current check treats incomplete tail blocks as
invalid; instead compute blocks per row as ceil(weight.shape[-1] / block_size)
and total expected_blocks = (weight.numel() // weight.shape[-1]) *
blocks_per_row so padded trailing blocks count toward the expected amax length.
In the validation around quantizer.block_sizes / block_size, replace
expected_blocks = weight.numel() // block_size with rows = weight.numel() //
weight.shape[-1]; blocks_per_row = math.ceil(weight.shape[-1] / block_size) (or
integer ceil via (N + block_size - 1)//block_size); expected_blocks = rows *
blocks_per_row, then return amax.numel() == expected_blocks and
global_amax.numel() == 1, allowing restored `_amax` that includes padded tail
blocks.
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 88-99: The TP>1 guard is too broad because it triggers for any
fake static-block quantizer; change the check that builds offending to only
consider NVFP4 static-block quantizers by requiring both
leaf.is_static_block_quant and that the leaf reports the NVFP4 format (e.g.,
leaf.format == "NVFP4" or the project’s NVFP4 enum/attribute — replace with the
actual attribute used in your quantizer objects) when iterating over leaves (the
variables/functions involved: weight_quantizer, SequentialQuantizer, leaves,
is_static_block_quant, offending, tp_group.world_size()); keep the rest of the
logic and the NotImplementedError unchanged.
In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 45-65: The test is comparing config.json's quantization_config to
the raw HF wrapper (hf_quant_config_dict) instead of the converted serving
format; change the test to use the converted structure (call
convert_hf_quant_config_format on hf_quant_config_dict or otherwise use the same
transformation used when producing config_dict) before asserting and before
indexing fields like "quant_algo", "ignore", and "config_groups"; update
references in the verification block so quant_config_dict refers to the
converted result (not the original hf_quant_config_dict) and then perform the
existing assertions and kv_cache checks against that converted object.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 818-828: The code currently only calls
_record_excluded_module(prefix) when qformat is None, but QUANTIZATION_NONE
should be treated the same; update the branch in unified_export_megatron.py (the
block around qformat, QUANTIZATION_NONE, _get_weight_bias, and
_record_excluded_module) so that if qformat is None or qformat ==
QUANTIZATION_NONE (and "norm" not in prefix) you record the module as excluded
before the early return; keep the existing early return for QUANTIZATION_NONE
but ensure the exclude is recorded first and keep compatibility with
_qkv_slicing behavior.
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 765-782: The loop currently accumulates best_constraints and
best_scores from best_hparam_recipe_info before replacing the local solver's
format with the DP/TP/EP-synchronized best_format; update the code so that after
you set best_recipe[name] = best_format and get_hparam(self.model, name).active
= best_format you recompute and add the costs and scores that correspond to the
actually activated best_format (not the original
best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores
that the solver produced for the layer (referencing best_recipe_info,
best_hparam_recipe_info and get_hparam) and use that entry to increment
best_constraints and best_scores (and keep
self.best["constraints"]/self.best["score"] consistent with the activated
recipe).
---
Nitpick comments:
In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py`:
- Around line 32-42: Add a focused unit test that exercises the restored
_global_amax code path: create an NVFP4StaticQuantizer via the existing helper
_make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the
private attribute _global_amax (not global_amax) to a tensor value, perform the
export/import (or the same round‑trip flow used elsewhere in this test file) and
assert the quantizer restores using the _global_amax path (e.g., resulting
amax/global_amax behavior matches expected values). Ensure the test is small,
documents the expected behavior, and only validates the single round‑trip
regression scenario so the `_global_amax` restore remains covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 864054fb-7e5a-459d-9bc8-f15b0be42e2b
📒 Files selected for processing (26)
CHANGELOG.rstexamples/specdec_bench/specdec_bench/datasets/speed.pymodelopt/torch/export/plugins/hf_checkpoint_utils.pymodelopt/torch/export/plugins/mcore_nemotron.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/backends/utils.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/conversion.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/custom.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/_test_utils/torch/quantization/quantize_common.pytests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.pytests/gpu_megatron/torch/export/test_unified_export_megatron.pytests/gpu_megatron/torch/quantization/plugins/test_megatron.pytests/unit/torch/export/test_hf_checkpoint_utils.pytests/unit/torch/quantization/plugins/test_fused_experts.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_mse_calibrator.pytests/unit/torch/quantization/test_nvfp4_static_export_cpu.py
| - quantizer_name: '*mixer.experts.*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mixer.experts.*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | ||
| - quantizer_name: '*mlp.experts*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mlp.experts*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 |
There was a problem hiding this comment.
Keep routed-expert weight blocks static in the max-calib variant.
This recipe says it differs from the MSE recipe by calibration method, but routed-expert weight quantizers are set to type: dynamic, which changes quantization behavior and undermines the max-vs-MSE comparison.
Proposed fix
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1
@@
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`
around lines 47 - 79, The routed-expert weight quantizers in this max-calib
recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and
'*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for
a fair max-vs-MSE comparison; update those two quantizer blocks to use type:
static (leave the corresponding input_quantizer blocks as-is) so only the weight
quantizers for routed experts switch from dynamic to static.
| # Calibration: weight MSE with FP8-scale sweep over the 128 e4m3 scale values | ||
| # (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales | ||
| # are also chosen via MSE search instead of plain amax). |
There was a problem hiding this comment.
Update the calibration comment for FP8 layers.
The comment says FP8 per-tensor scales are selected via MSE search, but this stack skips MSE for non-NVFP4 formats. This is misleading for recipe users.
Proposed fix
-# (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales
-# are also chosen via MSE search instead of plain amax).
+# (NVFP4 routed-expert weights use static block scales selected by MSE;
+# non-NVFP4 layers, such as FP8 per-tensor, follow the non-MSE path.)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 30 - 32, The calibration comment is misleading about FP8 scale selection:
update the comment near the calibration block that mentions "FP8 per-tensor
scales" and "NVFP4 weights" (the lines describing MSE searches) to explicitly
state that only NVFP4 weight block scales are selected via MSE while non-NVFP4
FP8 formats skip MSE and use the stack's default scaling method; edit the text
to clarify that FP8 per-tensor scales for non-NVFP4 are not MSE-searched to
avoid confusion for recipe users.
| block_sizes = getattr(quantizer, "block_sizes", None) | ||
| block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None | ||
| if block_size is None or weight.shape[-1] % block_size != 0: | ||
| return False | ||
| expected_blocks = weight.numel() // block_size | ||
| return amax.numel() == expected_blocks and global_amax.numel() == 1 |
There was a problem hiding this comment.
Handle padded trailing blocks when validating restored NVFP4 state.
Static block quantization already pads the tail block during setup, so a restored _amax can be complete even when weight.shape[-1] % block_size != 0. Returning False here forces max_calibrate() and overwrites the saved MSE-derived scales for those layers.
Suggested fix
block_sizes = getattr(quantizer, "block_sizes", None)
block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None
- if block_size is None or weight.shape[-1] % block_size != 0:
+ if block_size is None:
return False
- expected_blocks = weight.numel() // block_size
+ rows = weight.numel() // weight.shape[-1]
+ expected_blocks = rows * ((weight.shape[-1] + block_size - 1) // block_size)
return amax.numel() == expected_blocks and global_amax.numel() == 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/plugins/custom.py` around lines 148 - 153, The
current check treats incomplete tail blocks as invalid; instead compute blocks
per row as ceil(weight.shape[-1] / block_size) and total expected_blocks =
(weight.numel() // weight.shape[-1]) * blocks_per_row so padded trailing blocks
count toward the expected amax length. In the validation around
quantizer.block_sizes / block_size, replace expected_blocks = weight.numel() //
block_size with rows = weight.numel() // weight.shape[-1]; blocks_per_row =
math.ceil(weight.shape[-1] / block_size) (or integer ceil via (N + block_size -
1)//block_size); expected_blocks = rows * blocks_per_row, then return
amax.numel() == expected_blocks and global_amax.numel() == 1, allowing restored
`_amax` that includes padded tail blocks.
| leaves = ( | ||
| list(weight_quantizer) | ||
| if isinstance(weight_quantizer, SequentialQuantizer) | ||
| else [weight_quantizer] | ||
| ) | ||
| if any(leaf.is_static_block_quant for leaf in leaves): | ||
| offending.append((name, tp_group.world_size())) | ||
| if offending: | ||
| raise NotImplementedError( | ||
| "Static-block NVFP4 weight quantization (e.g. MSE) is not supported with TP > 1. Please re-run with TP=1. " | ||
| f"Offending modules (showing first 5 of {len(offending)}): {offending[:5]}" | ||
| ) |
There was a problem hiding this comment.
Narrow this TP guard to NVFP4-static weights only.
is_static_block_quant is true for every fake static-block format, not just NVFP4. With this predicate, TP>1 AWQ/INT4 block-quantized models now hit this new NotImplementedError, even though the message and PR scope are NVFP4/MSE-specific.
Suggested fix
- if any(leaf.is_static_block_quant for leaf in leaves):
+ if any(leaf.is_nvfp4_static for leaf in leaves):
offending.append((name, tp_group.world_size()))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/plugins/megatron.py` around lines 88 - 99, The
TP>1 guard is too broad because it triggers for any fake static-block quantizer;
change the check that builds offending to only consider NVFP4 static-block
quantizers by requiring both leaf.is_static_block_quant and that the leaf
reports the NVFP4 format (e.g., leaf.format == "NVFP4" or the project’s NVFP4
enum/attribute — replace with the actual attribute used in your quantizer
objects) when iterating over leaves (the variables/functions involved:
weight_quantizer, SequentialQuantizer, leaves, is_static_block_quant, offending,
tp_group.world_size()); keep the rest of the logic and the NotImplementedError
unchanged.
| # Make sure config.json and hf_quant_config.json use the same serving config. | ||
| assert config_dict["quantization_config"] == hf_quant_config_dict | ||
|
|
||
| # Verify config.json | ||
| if kv_cache_quant_cfg: | ||
| assert config_dict["quantization_config"]["kv_cache_scheme"]["num_bits"] == 8 | ||
|
|
||
| # Verify hf_quant_config.json | ||
| if quant_config: | ||
| quant_config_dict = hf_quant_config_dict["quantization"] | ||
| quant_config_dict = hf_quant_config_dict | ||
| quant_type = quant_config_dict["quant_algo"] | ||
| assert ( | ||
| quant_type in quant_config | ||
| ) # quant config str is subset of quant config e.g. NVFP4 -> NVFP4_DEFAULT_CFG | ||
| assert len(quant_config_dict["exclude_modules"]) > 1 # Dynamically added exclude modules | ||
| assert len(quant_config_dict["ignore"]) > 1 # Dynamically added exclude modules | ||
| if quant_type == "NVFP4": | ||
| assert quant_config_dict["group_size"] == 16 | ||
| assert quant_config_dict["config_groups"]["group_0"]["weights"]["group_size"] == 16 | ||
|
|
||
| if kv_cache_quant_cfg: | ||
| assert quant_config_dict["kv_cache_quant_algo"] == KV_CACHE_FP8 | ||
| assert quant_config_dict["kv_cache_scheme"]["num_bits"] == 8 | ||
|
|
There was a problem hiding this comment.
Assert against the converted serving config, not the raw HF wrapper.
hf_quant_config.json is still written as {"producer": ..., "quantization": ...}, while config.json["quantization_config"] gets the output of convert_hf_quant_config_format(...). This helper now compares unlike objects and then indexes quant_algo / ignore / config_groups at the wrong level, so the quantized export cases will fail or validate the wrong structure.
Suggested fix
+from modelopt.torch.export.convert_hf_config import convert_hf_quant_config_format
+
def _verify_model_quant_config(
export_dir: Path, quant_config: str | None = None, kv_cache_quant_cfg: str | None = None
):
"""Verify config.json and hf_quant_config.json"""
config_dict = json.load(open(export_dir / "config.json"))
hf_quant_config_dict = json.load(open(export_dir / "hf_quant_config.json"))
# Make sure config.json and hf_quant_config.json use the same serving config.
- assert config_dict["quantization_config"] == hf_quant_config_dict
+ assert config_dict["quantization_config"] == convert_hf_quant_config_format(
+ hf_quant_config_dict
+ )
# Verify config.json
if kv_cache_quant_cfg:
assert config_dict["quantization_config"]["kv_cache_scheme"]["num_bits"] == 8
# Verify hf_quant_config.json
if quant_config:
- quant_config_dict = hf_quant_config_dict
+ quant_config_dict = hf_quant_config_dict["quantization"]
quant_type = quant_config_dict["quant_algo"]As per coding guidelines, tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines
45 - 65, The test is comparing config.json's quantization_config to the raw HF
wrapper (hf_quant_config_dict) instead of the converted serving format; change
the test to use the converted structure (call convert_hf_quant_config_format on
hf_quant_config_dict or otherwise use the same transformation used when
producing config_dict) before asserting and before indexing fields like
"quant_algo", "ignore", and "config_groups"; update references in the
verification block so quant_config_dict refers to the converted result (not the
original hf_quant_config_dict) and then perform the existing assertions and
kv_cache checks against that converted object.
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
What does this PR do?
Type of change: New recipe + Bug Fixes
MCore and MSE fixes
NVFP4QTensor(not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore) --> fixes bug during MCore export for MSEblock_sizesis dict-backed.hf_quant_config.jsonExport bug fixes
Super recipe
Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:
rest: not quantized
Usage
# Add a code snippet demonstrating how to use thisTesting
TODO test in HF and MCore PTQ on Nemotron model
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes